Beginning to Understand Unstructured, Ungrammatical Text: An Information Integration Approach
نویسندگان
چکیده
As information agents become pervasive, they will need to read and understand the vast amount of information on the World Wide Web. One such valuable source of information is unstructured and ungrammatical text that appears in data sources such as online auctions or internet classifieds. One way to begin to understand this text is to figure out the entities that the text references. This can be thought of as the semantic annotation problem, where the goal is to extract the attributes embedded within the text and then annotate the text with these extracted attributes. If enough attributes can be extracted, then the entity referenced in the text can be determined. For example, if we have a used car for sale in a classified ad, and we can identify the make, model and year within the post, we can identify the car for sale. However, information extraction is difficult because the text does not contain reliable structural or grammatical clues. In this paper we present an unsupervised approach to semantically annotating such unstructured and ungrammatical text with the intention that this will help in the problem of machine understanding on the Web. Furthermore, we define an architecture that allows for better understanding over time. We present experiments to show our annotation approach is competitive with the state-of-the-art which uses supervised machine learning, even though our technique is fully unsupervised.
منابع مشابه
A Reference-set Approach to Information Extraction from Unstructured, Ungrammatical Data Sources
This thesis investigates information extraction from unstructured, ungrammatical text on the Web such as classified ads, auction listings, and forum postings. Since the data is unstructured and ungrammatical, this information extraction precludes the use of rule-based methods that rely on consistent structures within the text or natural language processing techniques that rely on grammar. Inste...
متن کاملAn Automatic Approach to Semantic Annotation of Unstructured, Ungrammatical Sources: A First Look∗
There exist numerous sources of data on the World Wide Web that contain useful information but are not structured or grammatical enough to support traditional information extraction. Furthermore, even if the information extraction could be done, the extracted values would need to be standardized to ensure the queries over the source are accurate. This paper presents an automatic, scalable appro...
متن کاملCreating Relational Data from Unstructured and Ungrammatical Data Sources
In order for agents to act on behalf of users, they will have to retrieve and integrate vast amounts of textual data on the World Wide Web. However, much of the useful data on the Web is neither grammatical nor formally structured, making querying difficult. Examples of these types of data sources are online classifieds like Craigslist and auction item listings like eBay. We call this unstructu...
متن کاملConstructing Reference Sets from Unstructured, Ungrammatical Text
Vast amounts of text on the Web are unstructured and ungrammatical, such as classified ads, auction listings, forum postings, etc. We call such text “posts.” Despite their inconsistent structure and lack of grammar, posts are full of useful information. This paper presents work on semi-automatically building tables of relational information, called “reference sets,” by analyzing such posts dire...
متن کاملExploiting Background Knowledge to Build Reference Sets for Information Extraction
Previous work on information extraction from unstructured, ungrammatical text (e.g. classified ads) showed that exploiting a set of background knowledge, called a “reference set,” greatly improves the precision and recall of the extractions. However, finding a source for this reference set is often difficult, if not impossible. Further, even if a source is found, it might not overlap well with ...
متن کامل